[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode by oneraghavan · Pull Request #43526 · vllm-project/vllm

oneraghavan · 2026-05-24T13:10:58Z

Purpose

When async scheduling and speculative decoding are both enabled, should_advance() misses the reasoning end token (e.g. </think>) because placeholder arithmetic produces an empty delta window.

Root Cause

should_advance() computes a delta slice of recently-generated tokens via:

delta_from = request.num_computed_tokens - request.num_output_placeholders
delta = all_token_ids[delta_from:]

Under async scheduling + spec decode:

num_computed_tokens is pre-incremented by _update_after_schedule (before GPU runs)
Adjusted down for rejected spec tokens in update_from_output
num_output_placeholders follows a parallel increment/decrement cycle
After token append + placeholder decrement, the arithmetic can produce delta_from == len(all_token_ids), yielding an empty delta

The reasoning end token sits in all_token_ids but the delta window walks past it. reasoning_ended never becomes True, and JSON grammar constraints are never applied.

Observed in production (from issue #43388):

new_token_ids = [9, 198, 248069, 271]   # 248069 = </think>
start = 5975                              # one past the end token at index 5974
delta = []                                # empty!

Fix

Add an optional new_token_ids parameter to should_advance(). At the token-output call site in update_from_output (the only path where new_token_ids is naturally available), pass the actual tokens directly. This bypasses the fragile placeholder arithmetic entirely.

The fallback path (draft-token call sites in update_draft_token_ids / update_draft_token_ids_in_output) is unchanged — those rely on reasoning_ended already being set by the prior update_from_output call.

Changes

File	Change
`vllm/v1/structured_output/__init__.py`	Add `new_token_ids` param to `should_advance()`; use it as delta when provided
`vllm/v1/core/sched/scheduler.py`	Pass `new_token_ids` at the `update_from_output` call site
`tests/v1/structured_output/test_reasoning_structured_output.py`	4 new unit tests covering the bug and fix
`tests/entrypoints/llm/test_struct_output_generate.py`	New E2E matrix entry: Qwen3 + spec decode + async scheduling

Test Plan

Unit tests (4 new, all pass locally):

test_should_advance_with_new_token_ids_detects_reasoning_end — end token found via new_token_ids regardless of placeholder state
test_should_advance_async_spec_decode_empty_delta_misses_end_token — reproduces the exact bug (empty delta), then shows new_token_ids fixes it
test_should_advance_new_token_ids_structural_tag_spec_decode — structural tag + spec decode same-step True (no regression)
test_should_advance_new_token_ids_no_end_token — negative case: no end token → reasoning_ended stays False

E2E test (new matrix entry):

Qwen/Qwen3-1.7B + qwen3 reasoning parser + ngram spec decode + async_scheduling=True — the exact combination that triggers the bug

Existing tests (all 13 pass):

tests/v1/structured_output/test_reasoning_structured_output.py — 13 passed

Related Work

Issue Bug: Speculative Decoding (MTP) Causes </think> Detection Failure in Structured Output + Reasoning Mode #34650: identical bug reported for DeepSeek-R1 + MTP (Feb 2026)
PR [Bugfix] Grammar was ignored when reasoning ended within speculated tokens #36138: comprehensive fix (open since Mar 2026, needs-rebase). That PR adds identify_constrained_draft_tokens() / _find_reasoning_end_in_tokens() across 3 files (+403/-95). This PR achieves the same detection fix with +25/-11 lines in 2 files.
Issue [Bug] PR #36138 grammar-mask spec-decode fix doesn't handle multi-token reasoning boundaries (gpt-oss/openai_gptoss still bleeds; Qwen3 fixed) #43338: related multi-token boundary variant (gpt-oss). This PR fixes single-token boundaries (Qwen3, DeepSeek-R1); multi-token markers need the prior-context approach from [V1][Bugfix] structured_output × spec-decode: pre-commit grammar filter for boundary-step bonus tokens #43424.
PR [BugFix] Fix async scheduling + reasoning with struct output #31332: the original async scheduling fix by @njhill that introduced delta_from = num_computed_tokens - num_output_placeholders. That formula is correct for async-without-spec-decode but breaks when spec decode rejection adjustments shift the counters.

gemini-code-assist

Code Review

This pull request improves the reliability of reasoning end token detection in structured output by passing new_token_ids directly to the should_advance method. This update avoids potential issues with placeholder arithmetic during asynchronous scheduling and speculative decoding. The changes include new unit tests covering various scenarios and an additional test case for Qwen3 models. I have no further feedback to provide as there were no review comments.

… scheduling + spec decode Fixes vllm-project#43388, vllm-project#34650 When async scheduling and speculative decoding are both enabled, should_advance() reconstructs the new tokens delta via placeholder arithmetic (num_computed_tokens - num_output_placeholders). After spec decode rejection adjustments and token appending, this arithmetic can produce start == len(all_token_ids), yielding an empty delta that misses the reasoning end token (e.g. </think>). As a result, reasoning_ended never flips to True and JSON grammar constraints are never applied. Fix: add an optional new_token_ids parameter to should_advance(). When the caller has the actual new tokens (the token-output path in update_from_output), pass them directly so the method checks them instead of re-deriving the delta from counter arithmetic. The fallback path (draft-token call sites) is unchanged. Signed-off-by: Raghavan <oneraghavan@gmail.com>

huanghuan3 · 2026-05-25T03:07:14Z

new_token_ids=[248069, 271, 71093]
248069 = </think>
71093 may correspond to the beginning of ```json {} ```

After changing the delta window to include the full new_token_ids batch, vLLM can correctly set reasoning_ended=True. However, any tokens generated after in that same batch were still sampled before the JSON grammar constraint became active.

This means the fix can make apply=True from the following step onward, but it cannot retroactively constrain, reject, or remove the already-generated post-thinking tokens. If those tokens include orjson, the final content may still contain Markdown fences.

There is also a related grammar state synchronization issue. If the first JSON token after has already been generated in the same speculative batch but was not accepted by the grammar FSM, the next constrained step may still expect the initial {, causing duplicated opening braces such as:

{{

how about this problem？

oneraghavan · 2026-05-25T09:03:27Z

Closing this PR in favor of #43424, which takes a more comprehensive approach.

Our fix here addresses the detection side (passing new_token_ids directly to should_advance to avoid the fragile placeholder arithmetic), but it doesn't handle the case where </think> lands mid-batch in a speculative decode step — the post-reasoning tokens in the same batch escape grammar validation entirely, as correctly pointed out by @huanghuan3.

#43424 solves this with a pre-commit filter (precommit_filter_tokens) that intercepts the token batch before _update_request_with_output, splits at the reasoning boundary, and validates/truncates post-boundary tokens. This approach:

Handles both single-token and multi-token reasoning markers
Composes cleanly with the existing should_advance deferral logic
Has been production-tested on multiple model/spec-decode configurations
Is additive (+144 lines, 0 deletions) with no existing API changes

Credit to @sfbemerk for the original analysis in #36138 that charted the path for all of these fixes.

oneraghavan requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, benchislett, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, russellb and ywang96 as code owners May 24, 2026 13:10

mergify Bot added structured-output v1 bug Something isn't working labels May 24, 2026

github-project-automation Bot added this to Structured Output May 24, 2026

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

oneraghavan force-pushed the fix/structured-output-reasoning-end-spec-decode branch from 9683efb to 14c6e4d Compare May 24, 2026 13:14

oneraghavan force-pushed the fix/structured-output-reasoning-end-spec-decode branch from 14c6e4d to b5bb916 Compare May 24, 2026 13:18

oneraghavan closed this May 25, 2026

github-project-automation Bot moved this to Done in Structured Output May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode#43526

[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode#43526
oneraghavan wants to merge 1 commit into
vllm-project:mainfrom
oneraghavan:fix/structured-output-reasoning-end-spec-decode

oneraghavan commented May 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

huanghuan3 commented May 25, 2026 •

edited

Loading

Uh oh!

oneraghavan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oneraghavan commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Root Cause

Fix

Changes

Test Plan

Related Work

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

huanghuan3 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oneraghavan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oneraghavan commented May 24, 2026 •

edited

Loading

huanghuan3 commented May 25, 2026 •

edited

Loading